It is a well known fact and it has been observed in many real life data that the first digit of the data is not distributed evenly among all possible first digits 1 to 9. Rather the observed distribution for the digit i is
This is called Benford's law.
To test this law, I stole stock data from an site, and listed the daily trade volume of their stock into a file.
>v=readmatrix("volumes.dat")';
Here are the first 100 values.
>v[1:200]
[544157, 544157, 156075, 11862, 265267, 1.71407e+006, 479429, 1.68337e+006, 22709, 130265, 42516, 41107, 6995, 149473, 191975, 62920, 42462, 39557, 63524, 16511, 865995, 48299, 69100, 109735, 86305, 84685, 237742, 122700, 153005, 137054, 73998, 18846, 55636, 68715, 23939, 16670, 149444, 129363, 290037, 6582, 187479, 8488, 26133, 81853, 26185, 25048, 80988, 213087, 881591, 3.18527e+006, 117500, 86400, 113609, 18719, 1.05462e+006, 1.77765e+007, 1.71189e+007, 650646, 1.47722e+006, 270910, 456247, 67000, 355126, 489772, 916030, 1859, 95172, 118442, 125030, 172168, 101264, 265043, 93265, 588767, 3.66473e+006, 2.10848e+006, 147272, 452850, 2.51024e+006, 129672, 599131, 1.17062e+006, 43360, 3.53408e+006, 2.15011e+006, 261405, 10500, 41615, 1.01669e+006, 127458, 226756, 101700, 2.7435e+006, 274591, 291125, 14340, 59780, 178830, 1.21082e+006, 128018, 856704, 378086, 664557, 405999, 828502, 335948, 576086, 604670, 134000, 83000, 654082, 812007, 639507, 3.24893e+006, 268568, 3.93623e+006, 9.15216e+006, 8.02835e+006, 1.8367e+006, 221005, 386657, 1.57159e+006, 68764, 58385, 896267, 30670, 5.0522e+006, 93478, 3.66215e+006, 279015, 5650, 726680, 136385, 106236, 490220, 304752, 1.52216e+006, 1.95674e+006, 1.97216e+006, ... ]
To find the first digit of a number in its decimal representation, we can use the following function.
>function fd(x) ... n=floor(log10(x)); return floor(x/10^n) endfunction
The expected distribution is the following.
>d=log10(2:10)-log10(1:9)
[0.30103, 0.176091, 0.124939, 0.09691, 0.0791812, 0.0669468, 0.0579919, 0.0511525, 0.0457575]
Let us plot the distribution of the first digits in our data versus the expected distribution.
>columnsplot(getmultiplicities(1:9,fd(v))/length(v)); ... plot2d(1:9,d,>add); ... plot2d(1:9,d,>points,>add); ... insimg;
It is obvious that Benford's law is a better description of the reality than the assumption of an equal distribution.
Let us perform a statistical test to check the distribution. We first need the frequencies of the first digits in v.
>frequ=count(fd(v)-1,9)
[976, 530, 350, 267, 195, 178, 175, 152, 124]
With an equal distribution, we expect about 327 cases for each digit.
>nf=sum(frequ), fe=nf/9
2947 327.444444444
The chi-square test fails, of course.
>chitest(frequ,ones(1,9)*fe)
0
However, the chi-test fails also for our expected distribution, though not as badly as before.
>chitest(frequ,d*nf)
0.00887027918168
One could argue, that the distribution of the values is exponential. But this is not really the case. A logarithmic distribution of the random variable X would satisfy
If we compute the logarithms for a cumulative frequency count of the values y, we see, that it is by no means linear.
>{x,y}=histo(v,20); plot2d(x,log(cumsum(flipx(y))),bar=true):
Let us compare this with an exponential distribution with the same mean value.
>m=mean(v)
469396.739396
The following Monte Carlo simulation produces such a distribution.
>vt=-log(random(1,100000))*m; mean(vt)
467459.444843
>{x,y}=histo(vt,20); plot2d(x,log(cumsum(flipx(y))),>bar):
Let us compare the distributing of the first digits we found in our sample with an exponential distribution. For this, we produce a sequence of first digits, which are distributed according to Benford's law.
>va=fd(10^random(1,nf));
We insert these into the same plot as above.
>columnsplot(getmultiplicities(1:9,fd(va))/length(va)); ... plot2d(1:9,d,>add); ... plot2d(1:9,d,>points,>add); ... insimg;
>frequa=count(fd(va)-1,9)
[834, 526, 368, 304, 232, 225, 171, 166, 121]
Most likely, the chi-square test cannot reject the distribution d for our Monte Carlo data.
>chitest(frequa,d*nf)
0.183718883574
A more general way to express Benford's law is to state that the fractions of the decadic logarithm of the data is distributed evenly in [0,1].
We can plot the distribution of these fractions for our data.
>plot2d(mod(log10(v),1),>distribution):
The result looks very good.